Skip to content

Docs: encryption#14621

Merged
huaxingao merged 9 commits intoapache:mainfrom
ggershinsky:encr-doc
Dec 3, 2025
Merged

Docs: encryption#14621
huaxingao merged 9 commits intoapache:mainfrom
ggershinsky:encr-doc

Conversation

@ggershinsky
Copy link
Contributor

No description provided.

@github-actions github-actions bot added the docs label Nov 19, 2025
@ggershinsky
Copy link
Contributor Author

cc @huaxingao

Comment on lines 74 to 77
To function properly, Iceberg table encryption places the following requirements on the catalogs:

1. For protection of table data confidentiality, the table encryption properties (`encryption.key-id` and an optional `encryption.data-key-length`) must be kept in a tamper-proof storage or in a trusted independent database. Catalogs must not retrieve these properties directly from the metadata.json, if this file is kept unprotected in a storage vulnerable to tampering.
2. For protection of table integrity, the metadata json must be kept in a tamper-proof storage or in a trusted independent object store. Catalogs must not retrieve the metadata.json file directly, if it is kept unprotected in a storage vulnerable to tampering.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this section! It is very helpful in understanding motivations

Realise this may still be in-progress, but just flagging that I'm a bit confused from an IRC implementation perspective about what's required here.

I gather from (2) and the devlist thread that for untrusted storage (I'd imagine most IRCs currently write metadata JSONs in the same object storage as data files), an IRC should tamper-check a metadata JSON before returning its TableMetadata spec object - however, it then feels that (1) is not necessary anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this can be simplified.

1. Catalog property `encryption.kms-impl`, that specifies the class path for a client of a KMS ("key management service").
2. Table property `encryption.key-id`, that specifies the ID of a master key used to encrypt and decrypt the table. Master keys are stored and managed in the KMS.

The `encryption.key-id` must be set during the table creation, and never modified or removed during the table lifetime.
Copy link

@palladium-coder palladium-coder Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should rest_spec.md also be updated to ensure this requirement in order to prevent accidents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requirement is applicable to all catalogs. Per our recent community discussion, the rest encryption updates should be rest-specific.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a document/spec which tracks how catalogs should behave .

I would prefer to avoid accidents like these and any custom catalogue to know these behaviours exist .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add a pointer to the catalog security requirements section to the custom-catalogs.md.

Copy link

@palladium-coder palladium-coder Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification and updating the custom catalog doc.

@huaxingao huaxingao merged commit 8626ef5 into apache:main Dec 3, 2025
4 checks passed
@huaxingao
Copy link
Contributor

Thanks @ggershinsky for the PR! Thanks @palladium-coder @smaheshwar-pltr for the review!

@manuzhang
Copy link
Member

manuzhang commented Dec 4, 2025

@ggershinsky This doesn't render correctly. Can you check and fix? It would be best to share a snapshot of the page in the PR from local doc build.

CleanShot 2025-12-04 at 09 50 52@2x

@huaxingao
Copy link
Contributor

I fixed the rendering problem in #14756

thomaschow pushed a commit to thomaschow/iceberg that referenced this pull request Jan 19, 2026
* initial commit

* clean up

* brief how it works section

* clean up

* add refs

* discussion updates

* address review comments

* add ref to custom catalogs doc

* add line break

Iceberg table encryption protects confidentiality and integrity of table data in an untrusted storage. The `data`, `delete`, `manifest` and `manifest list` files are encrypted and tamper-proofed before being sent to the storage backend.

The `metadata.json` file does not contain data or stats, and is therefore not encrypted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not true, please check this PR on stats contained in the metadata.json #14502 (comment), thoughts on exploit

cc @ggershinsky

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @singhpk234, I'll have a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @singhpk234, can you please clarify a few points

  1. Is the following true? - partition summary is not a part of the snapshot summary in the Iceberg spec https://iceberg.apache.org/spec/#optional-snapshot-summary-fields ; but in the implementation, it is added sometimes to the snapshots and can contain data stats.
  2. If yes, when the partition summaries are enabled, and when do they have stats? Is any of this under the writing user control?
  3. Can you give an example of such a usecase (that writes stats to the metadata.json)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking look @ggershinsky !

  1. Partition summary or even any summary props are supposed to be optional and not defined in spec, when you use iceberg java impl these are collected when write.summary.partition-limit is enable (default off) please check this for details : https://iceberg.apache.org/docs/nightly/configuration/#write-properties
  2. Yes they are enabled by the writer (user setting this table prop write.summary.partition-limit engines such as spark etc collect these when enabled.
  3. please check this https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/SnapshotSummary.java#L202
    on a high level it can reveal which column values and stats on file counts etc
    https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/SnapshotSummary.java#L163

please let me know what do you think about it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are partition summary fields encapsulated in the SnapshotSummary.UpdateMetrics class?
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/SnapshotSummary.java#L223

Looks like a set of counters. Are there stats (col min/max) too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants